Treebank-Based Acquisition of Multilingual Unification Grammar Resources
نویسندگان
چکیده
Deep unification(constraint-)based grammars are usually hand-crafted. Scaling such grammars from fragments to unrestricted text is time-consuming and expensive. This problem can be exacerbated in multilingual broad-coverage grammar development scenarios. Cahill et al. (2002, 2004) and O’Donovan et al. (2004) present an automatic f-structure annotation-based methodology to acquire broad-coverage, deep, Lexical-Functional Grammar (LFG) resources for English from the Penn-II Treebank. In this paper we show how this model can be adapted to a multilingual grammar development scenario to induce robust, wide-coverage, PCFG-based LFG approximations for German from the TIGER Treebank. We show how the architecture of LFG, in particular the distinction between c-structure and f-structure representations, facilitates multilingual, treebank-based unification grammar induction, allowing us to cross-linguistically reuse the lexical extraction and parsing modules from O’Donovan et al. (2004) and Cahill et al. (2004), respectively. We evaluate our grammars against the PARC 700 Dependency Bank (King et al., 2003), against dependency structures for 2000 held-out sentences from the TIGER Corpus as well as against a hand-crafted dependency gold standard for 100 TIGER trees. Currently, our resources achieve 81.79% f-score against the PARC 700, a 2.19% improvement over the best result reported for a hand-crafted grammar in Kaplan et al. (2004), 74.6% against the 2000 held-out TIGER dependency structures and 71.08% against the 100-sentence TIGER gold standard, with substantially improved coverage compared to hand-crafted resources. We have since applied our methodology to induce wide-coverage LFG resources for Chinese (Burke et al., 2004b) from the Penn Chinese Treebank (Xue et al., 2002) and for Spanish from the CAST3LB Treebank (Civit, 2003).
منابع مشابه
Treebank-Based Multilingual Unification-Grammar Development
Broad-coverage, deep unification grammar development is time-consuming and costly. This problem can be exacerbated in multilingual grammar development scenarios. Recently (Cahill et al., 2002) presented a treebank-based methodology to semi-automatically create broadcoverage, deep, unification grammar resources for English. In this paper we present a project which adapts this model to a multilin...
متن کاملTreebank-Based Acquisition of Chinese LFG Resources for Parsing and Generation
This thesis describes a treebank-based approach to automatically acquire robust, wide-coverage Lexical-Functional Grammar (LFG) resources for Chinese parsing and generation, which is part of a larger project on the rapid construction of deep, large-scale, constraint-based, multilingual grammatical resources. I present an application-oriented LFG analysis for Chinese core linguistic phenomena an...
متن کاملAutomatic Acquisition of Lexical-Functional Grammar Resources from a Japanese Dependency Corpus
This paper describes a method for automatic acquisition of wide-coverage treebank-based deep linguistic resources for Japanese, as part of a project on treebankbased induction of multilingual resources in the framework of Lexical-Functional Grammar (LFG). We automatically annotate LFG f-structure functional equations (i.e. labelled dependencies) to the Kyoto Text Corpus version 4.0 (KTC4) (Kuro...
متن کاملThe Spanish DELPH-IN grammar
In this article we present a Spanish grammar implemented in the Linguistic Knowledge Builder system and grounded in the theoretical framework of Head-driven Phrase Structure Grammar. The grammar is being developed in an international multilingual context, the DELPH-IN Initiative, contributing to an open-source repository of software and linguistic resources for various Natural Language Processi...
متن کامل